7 research outputs found

    Integrating and visualising primary data from prospective and legacy taxonomic literature.

    Get PDF
    Specimen data in taxonomic literature are among the highest quality primary biodiversity data. Innovative cybertaxonomic journals are using workflows that maintain data structure and disseminate electronic content to aggregators and other users; such structure is lost in traditional taxonomic publishing. Legacy taxonomic literature is a vast repository of knowledge about biodiversity. Currently, access to that resource is cumbersome, especially for non-specialist data consumers. Markup is a mechanism that makes this content more accessible, and is especially suited to machine analysis. Fine-grained XML (Extensible Markup Language) markup was applied to all (37) open-access articles published in the journal Zootaxa containing treatments on spiders (Order: Araneae). The markup approach was optimized to extract primary specimen data from legacy publications. These data were combined with data from articles containing treatments on spiders published in Biodiversity Data Journal where XML structure is part of the routine publication process. A series of charts was developed to visualize the content of specimen data in XML-tagged taxonomic treatments, either singly or in aggregate. The data can be filtered by several fields (including journal, taxon, institutional collection, collecting country, collector, author, article and treatment) to query particular aspects of the data. We demonstrate here that XML markup using GoldenGATE can address the challenge presented by unstructured legacy data, can extract structured primary biodiversity data which can be aggregated with and jointly queried with data from other Darwin Core-compatible sources, and show how visualization of these data can communicate key information contained in biodiversity literature. We complement recent studies on aspects of biodiversity knowledge using XML structured data to explore 1) the time lag between species discovery and description, and 2) the prevalence of rarity in species descriptions

    Enriched biodiversity data as a resource and service

    Get PDF
    Background: Recent years have seen a surge in projects that produce large volumes of structured, machine-readable biodiversity data. To make these data amenable to processing by generic, open source “data enrichment” workflows, they are increasingly being represented in a variety of standards-compliant interchange formats. Here, we report on an initiative in which software developers and taxonomists came together to address the challenges and highlight the opportunities in the enrichment of such biodiversity data by engaging in intensive, collaborative software development: The Biodiversity Data Enrichment Hackathon. Results: The hackathon brought together 37 participants (including developers and taxonomists, i.e. scientific professionals that gather, identify, name and classify species) from 10 countries: Belgium, Bulgaria, Canada, Finland, Germany, Italy, the Netherlands, New Zealand, the UK, and the US. The participants brought expertise in processing structured data, text mining, development of ontologies, digital identification keys, geographic information systems, niche modeling, natural language processing, provenance annotation, semantic integration, taxonomic name resolution, web service interfaces, workflow tools and visualisation. Most use cases and exemplar data were provided by taxonomists. One goal of the meeting was to facilitate re-use and enhancement of biodiversity knowledge by a broad range of stakeholders, such as taxonomists, systematists, ecologists, niche modelers, informaticians and ontologists. The suggested use cases resulted in nine breakout groups addressing three main themes: i) mobilising heritage biodiversity knowledge; ii) formalising and linking concepts; and iii) addressing interoperability between service platforms. Another goal was to further foster a community of experts in biodiversity informatics and to build human links between research projects and institutions, in response to recent calls to further such integration in this research domain. Conclusions: Beyond deriving prototype solutions for each use case, areas of inadequacy were discussed and are being pursued further. It was striking how many possible applications for biodiversity data there were and how quickly solutions could be put together when the normal constraints to collaboration were broken down for a week. Conversely, mobilising biodiversity knowledge from their silos in heritage literature and natural history collections will continue to require formalisation of the concepts (and the links between them) that define the research domain, as well as increased interoperability between the software platforms that operate on these concepts

    Community engagement: The ‘last mile’ challenge for European research e-infrastructures

    Get PDF
    Europe is building its Open Science Cloud; a set of robust and interoperable e-infrastructures with the capacity to provide data and computational solutions through cloud-based services. The development and sustainable operation of such e-infrastructures are at the forefront of European funding priorities. The research community, however, is still reluctant to engage at the scale required to signal a Europe-wide change in the mode of operation of scientific practices. The striking differences in uptake rates between researchers from different scientific domains indicate that communities do not equally share the benefits of the above European investments. We highlight the need to support research communities in organically engaging with the European Open Science Cloud through the development of trustworthy and interoperable Virtual Research Environments. These domain-specific solutions can support communities in gradually bridging technical and socio-cultural gaps between traditional and open digital science practice, better diffusing the benefits of European e-infrastructures

    Unifying European Biodiversity Informatics (BioUnify)

    Get PDF
    In order to preserve the variety of life on Earth, we must understand it better. Biodiversity research is at a pivotal point with research projects generating data at an ever increasing rate. Structuring, aggregating, linking and processing these data in a meaningful way is a major challenge. The systematic application of information management and engineering technologies in the study of biodiversity (biodiversity informatics) help transform data to knowledge. However, concerted action is required to be taken by existing e-infrastructures to develop and adopt common standards, provisions for interoperability and avoid overlapping in functionality. This would result in the unification of the currently fragmented landscape that restricts European biodiversity research from reaching its full potential. The overarching goal of this COST Action is to coordinate existing research and capacity building efforts, through a bottom-up trans-disciplinary approach, by unifying biodiversity informatics communities across Europe in order to support the long-term vision of modelling biodiversity on earth. BioUnify will: 1. specify technical requirements, evaluate and improve models for efficient data and workflow storage, sharing and re-use, within and between different biodiversity communities; 2. mobilise taxonomic, ecological, genomic and biomonitoring data generated and curated by natural history collections, research networks and remote sensing sources in Europe; 3. leverage results of ongoing biodiversity informatics projects by identifying and developing functional synergies on individual, group and project level; 4. raise technical awareness and transfer skills between biodiversity researchers and information technologists; 5. formulate a viable roadmap for achieving the long-term goals for European biodiversity informatics, which ensures alignment with global activities and translates into efficient biodiversity policy

    BioVeL : a virtual laboratory for data analysis and modelling in biodiversity science and ecology

    Get PDF
    Background: Making forecasts about biodiversity and giving support to policy relies increasingly on large collections of data held electronically, and on substantial computational capability and capacity to analyse, model, simulate and predict using such data. However, the physically distributed nature of data resources and of expertise in advanced analytical tools creates many challenges for the modern scientist. Across the wider biological sciences, presenting such capabilities on the Internet (as "Web services") and using scientific workflow systems to compose them for particular tasks is a practical way to carry out robust "in silico" science. However, use of this approach in biodiversity science and ecology has thus far been quite limited. Results: BioVeL is a virtual laboratory for data analysis and modelling in biodiversity science and ecology, freely accessible via the Internet. BioVeL includes functions for accessing and analysing data through curated Web services; for performing complex in silico analysis through exposure of R programs, workflows, and batch processing functions; for on- line collaboration through sharing of workflows and workflow runs; for experiment documentation through reproducibility and repeatability; and for computational support via seamless connections to supporting computing infrastructures. We developed and improved more than 60 Web services with significant potential in many different kinds of data analysis and modelling tasks. We composed reusable workflows using these Web services, also incorporating R programs. Deploying these tools into an easy-to-use and accessible 'virtual laboratory', free via the Internet, we applied the workflows in several diverse case studies. We opened the virtual laboratory for public use and through a programme of external engagement we actively encouraged scientists and third party application and tool developers to try out the services and contribute to the activity. Conclusions: Our work shows we can deliver an operational, scalable and flexible Internet-based virtual laboratory to meet new demands for data processing and analysis in biodiversity science and ecology. In particular, we have successfully integrated existing and popular tools and practices from different scientific disciplines to be used in biodiversity and ecological research.Peer reviewe

    Inferring large phylogenies: The big tree problem

    Get PDF
    Phylogenetic trees are graph-like structures whose topology describes the inferred pattern of relationships among a set of biological entities, such as species or DNA sequences. Inference of these phylogenies typically involves evaluating large numbers of possible solutions and choosing the optimal topology, or set of topologies, from among all evaluated solutions. Such analyses are computationally intensive, especially when the pattern of relationships among a large number of entities is being sought. This thesis introduces two novel algorithms for the inference of large trees; one is applicable to the likelihood framework, the other to the Bayesian framework. Both approaches rely on the notion of a multi-modal tree ‘landscape’ through which inferential algorithms traverse. Using sampling techniques, the landscape can be perturbed sequentially, such that local optima can be evaded. The algorithms find good solutions in reasonable time, as demonstrated using real and simulated data sets. An example of large phylogeny inference is presented in the form of a novel estimate of Primate phylogeny – the largest estimate for this Order to date. The phylogeny is based on previously published smaller phylogenies, and hence serves as a summary of the present state of Primate phylogeny. In addition to this ‘supertree’s’ topology, composite estimates of divergence are provided also. These estimates are based on multiple, clock-like genes combined using a novel approach presented here. Handling sets of trees and sequences poses practical problems in terms of conversion of data and the interoperation between computer programs. The thesis therefore concludes with a chapter discussing suitable data structures and programming patterns for phylogenetics. The appendix discusses an implementation of some of these concepts in an object-oriented application programming interface

    Emerging semantics to link phenotype and environment

    Get PDF
    Understanding the interplay between environmental conditions and phenotypes is a fundamental goal of biology. Unfortunately, data that include observations on phenotype and environment are highly heterogeneous and thus difficult to find and integrate. One approach that is likely to improve the status quo involves the use of ontologies to standardize and link data about phenotypes and environments. Specifying and linking data through ontologies will allow researchers to increase the scope and flexibility of large-scale analyses aided by modern computing methods. Investments in this area would advance diverse fields such as ecology, phylogenetics, and conservation biology. While several biological ontologies are well-developed, using them to link phenotypes and environments is rare because of gaps in ontological coverage and limits to interoperability among ontologies and disciplines. In this manuscript, we present (1) use cases from diverse disciplines to illustrate questions that could be answered more efficiently using a robust linkage between phenotypes and environments, (2) two proof-of-concept analyses that show the value of linking phenotypes to environments in fishes and amphibians, and (3) two proposed example data models for linking phenotypes and environments using the extensible observation ontology (OBOE) and the Biological Collections Ontology (BCO); these provide a starting point for the development of a data model linking phenotypes and environments
    corecore